Compatibility Issue with Chinese Text in Document Parsing #3530

Coniferish · 2024-08-16T18:50:59Z

Duplicate of #3267 since forked PRs are failing to pass chipper CI tests

…apply to text type classification - Added a `languages` attribute to the Document base class. This attribute is essential to express the current language nature of a document, as language issues are encountered in various methods across the document. Having a common language array as a default value is necessary, and this attribute also partially meets the requirements of domain-driven design. - Added `languages` option to `DocxPartitionerOptions` to specify a list of languages to use for text type classification. - Modified `_DocxPartitioner.detect_text_type()` to use the specified languages or automatically detect the languages if "auto" is specified. - This allows the partitioner to more accurately classify text elements based on the language, improving the overall partitioning quality. - For HTML and MD (MD utilizes the HTML partition method), the `languages` field is passed through the entire construction chain until it is finally used in the `is_possible_narrative_text` and `is_possible_title` functions. Previously, although these two functions supported different judgments for different languages, the `languages` parameter was not correctly passed, which led to this capability not being enabled. This update enables this capability. - **BREAKING CHANGE**: The `DocxPartitionerOptions` constructor and some other partition functions now require a new `languages` parameter. This is a breaking change for any existing code. However, since most parameters have default values, it is not entirely a breaking change. This is merely a warning. In fact, docx and md test cases have been retested and passed, and simple test cases for the new feature have been submitted to ensure the functionality works correctly. --- ### feat(unstructured/partition/docx.py): 添加语言检测并应用于文本类型分类 - 在 Document 基础类中添加了 `languages` 属性。文档应该具有一个类似的属性来表达文档当前的语言性质，因为在文档的各个方法中都会遇到语言问题。在这些场景中，有一个公共的语言数组作为默认值是必要的，而且这个属性在某种程度上也满足了领域驱动设计的要求。 - 在 `DocxPartitionerOptions` 中添加了 `languages` 选项，用于指定用于文本类型分类的语言列表。 - 修改了 `_DocxPartitioner.detect_text_type()`，以使用指定的语言或在指定为 "auto" 时自动检测语言。 - 这使得分区器能够更准确地基于语言对文本元素进行分类，从而提高整体分区质量。 - 对于 HTML 和 MD（MD 利用了 HTML 的分区方法），`languages` 字段在整个构造链中一路传递，直到在 `is_possible_narrative_text` 和 `is_possible_title` 函数中最终使用。此前，虽然这两个函数支持针对不同语言进行不同的判断，但 `languages` 参数没有正确传递，这导致这一能力一直未被启用。本次更新启用了这一能力。 - **破坏性更改**: `DocxPartitionerOptions` 构造函数和其他一些分区函数现在需要一个新的 `languages` 参数。这对于现有的代码是一个破坏性更改。然而，由于大多数参数都有默认值，所以并不完全算是破坏性更新，这仅是一个警告。实际上，docx 和 md 的测试用例已经重新测试并通过，同时针对新的功能也提交了简单的测试用例以确保功能正常运行。

进行了全量测试，并基本保持了与main分支一致的通过率。

… "DocxPartitionerOptions" to collapse into keyword arguments (kwargs). 2. Change "capitalizable_languages" to "non_capitalizable_languages" in the function "is_possible_narrative_text".

…ure/zh_adaptation # Conflicts: # CHANGELOG.md

This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.

…ure/zh_adaptation

# Conflicts: # test_unstructured/documents/test_html.py # unstructured/documents/base.py # unstructured/documents/html.py # unstructured/documents/xml.py # unstructured/partition/epub.py # unstructured/partition/html.py

…ure/zh_adaptation

This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.

…e code formatting - Update CHANGELOG.md to include compatibility issue fix for Chinese text in document parsing. - Reformat import statements in test_odt.py for better readability. - Adjust import order in html.py to adhere to PEP8 guidelines. - Add `languages` parameter to text processing functions in pdf.py and text.py for improved language handling. - Reformat long lines to improve code readability and maintain consistency. Co-authored-by: Your Name <[email protected]>

# Conflicts: # unstructured/documents/html.py # unstructured/partition/html.py

…cOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS.

…ure/zh_adaptation

Added logic in the `test_weaviate_schema_is_valid` test function to check the existing Weaviate schema. If the class to be created already exists, the creation step is skipped and a corresponding message is printed to avoid creating a duplicate class.

# Conflicts: # CHANGELOG.md # examples/pgvector/pgvector.ipynb # examples/training/0-Core Concepts.ipynb # examples/training/1-Intro to Bricks.ipynb # examples/training/2-File Exploration.ipynb # examples/weaviate/weaviate.ipynb # test_unstructured/partition/test_auto.py # unstructured/documents/html.py

解决了中文测试文档中的一些格式问题。

Add change log

…zh_adaptation # Conflicts: # CHANGELOG.md # unstructured/__version__.py

Add change log

# Conflicts: # CHANGELOG.md

Add change log

JIAQIA and others added 30 commits May 23, 2024 19:46

适配了[""]这种languages导致的一些过程中的问题。

e2bbcc8

进行了全量测试，并基本保持了与main分支一致的通过率。

lint check

7da3d35

1. Modify the "languages" parameter in the initialisation function of…

aacd9ca

… "DocxPartitionerOptions" to collapse into keyword arguments (kwargs). 2. Change "capitalizable_languages" to "non_capitalizable_languages" in the function "is_possible_narrative_text".

Merge branch 'main' into feature/zh_adaptation

c96ef2d

Update CHANGELOG.md

57f0afb

Merge remote-tracking branch 'origin/feature/zh_adaptation' into feat…

c8ce18f

…ure/zh_adaptation # Conflicts: # CHANGELOG.md

Update CHANGELOG.md

5b07f6f

Merge branch 'main' into feature/zh_adaptation

e04e38d

Fix Language auto bug

f660e2e

"Fix incorrect narrative text detection"

2bfe800

This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.

Merge remote-tracking branch 'origin/feature/zh_adaptation' into feat…

c25eadd

…ure/zh_adaptation

Merge branch 'refs/heads/main' into feature/zh_adaptation

2e5bdd6

# Conflicts: # test_unstructured/documents/test_html.py # unstructured/documents/base.py # unstructured/documents/html.py # unstructured/documents/xml.py # unstructured/partition/epub.py # unstructured/partition/html.py

Merge remote-tracking branch 'origin/feature/zh_adaptation' into feat…

1e489d0

…ure/zh_adaptation

"Fix incorrect narrative text detection"

4990143

This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.

"Fix incorrect narrative text detection"

b81cd0c

This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.

Merge branch 'refs/heads/main' into feature/zh_adaptation

c8017fd

# Conflicts: # unstructured/documents/html.py # unstructured/partition/html.py

Merge from main branch

31993ab

Merge branch 'main' into feature/zh_adaptation

338eded

Merge branch 'main' into feature/zh_adaptation

915f275

Merge remote-tracking branch 'origin/feature/zh_adaptation' into feat…

4f49450

…ure/zh_adaptation

but fix:

bca6bb1

解决了中文测试文档中的一些格式问题。

doc:

afec02f

Add change log

doc:

36fb66c

Add change log

Merge remote-tracking branch 'refs/remotes/origin/main' into feature/…

96ca057

…zh_adaptation # Conflicts: # CHANGELOG.md # unstructured/__version__.py

JIAQIA and others added 4 commits August 14, 2024 13:14

doc:

6dc2aed

Add change log

Merge branch 'refs/heads/main' into feature/zh_adaptation

7c0adf7

# Conflicts: # CHANGELOG.md

doc:

1a8adb3

Add change log

Merge branch 'main' into jj/zh_adaptation

660bd5b

Coniferish temporarily deployed to ci August 16, 2024 18:54 — with GitHub Actions Inactive

Coniferish had a problem deploying to ci August 16, 2024 18:54 — with GitHub Actions Failure

Coniferish temporarily deployed to ci August 16, 2024 18:54 — with GitHub Actions Inactive

Merge branch 'main' into jj/zh_adaptation

f0a6755

Coniferish mentioned this pull request Aug 17, 2024

Compatibility Issue with Chinese Text in Document Parsing <- Ingest test fixtures update #3534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility Issue with Chinese Text in Document Parsing #3530

Compatibility Issue with Chinese Text in Document Parsing #3530

Coniferish commented Aug 16, 2024

Compatibility Issue with Chinese Text in Document Parsing #3530

Are you sure you want to change the base?

Compatibility Issue with Chinese Text in Document Parsing #3530

Conversation

Coniferish commented Aug 16, 2024